Exploratory Analysis of table actor_name¶

Filter unwanted columns¶

According to the wiki page, we can get rid of those columns:

  • name_type
  • name_number

Table extract¶

pk_actor_name concat_acna is_standard_name lang_iso name first_name ordinal_text ordinal_num particle title ... creator creation_time modifier modification_time concat_name fk_abob_name_type begin_month begin_day end_month end_day
10051 10359 AcNa10359 True Willaume Fernand None NaN None None ... 11.0 2008-07-18 18:43:44.000 3.0 2013-02-14 11:00:03 Willaume, Fernand NaN NaN NaN NaN NaN
51293 52157 AcNa52157 True ita Saletta Giovanni Alessandro None NaN None None ... 30.0 2014-03-29 15:20:27.990 30.0 2014-03-29 16:09:32 Saletta, Giovanni Alessandro - da Chiari 1058.0 NaN NaN NaN NaN
56773 57724 AcNa57724 True deu Lange Christian None NaN None None ... 3.0 2014-09-11 22:47:27.880 NaN NaT Lange, Christian 1058.0 NaN NaN NaN NaN
1813 2088 AcNa2088 True Castellano Fernando None NaN None None ... 27.0 2008-11-09 09:24:28.000 3.0 2013-02-14 11:00:03 Castellano, Fernando NaN NaN NaN NaN NaN
11487 11804 AcNa11804 True Bailhache None None NaN None None ... 2.0 2008-07-19 00:14:18.000 3.0 2013-02-14 11:00:03 Bailhache NaN NaN NaN NaN NaN

5 rows × 28 columns

Discovery¶

Columns contain:
Total number of rows: 67293
  -      "pk_actor_name":   0.00% empty - 67293 (100.00%) uniques (eg: 49829; 49830; 49832)
  -        "concat_acna":   0.00% empty - 67293 (100.00%) uniques (eg: AcNa49829; AcNa49830; AcNa49832)
  -   "is_standard_name":   0.00% empty -     2 (  0.00%) uniques (eg: True; False)
  -        "concat_name":   0.00% empty - 63642 ( 94.57%) uniques (eg: Otte, Bern...; Staud, Joh...; Roma, Giul...)
  -      "creation_time":   0.00% empty - 40469 ( 60.14%) uniques (eg: 2013-02-20...; 2013-02-20...; 2013-02-20...)
  -           "fk_actor":   0.00% empty - 61555 ( 91.47%) uniques (eg: 46706; 46707; 46709)
  -            "creator":   0.00% empty -    89 (  0.13%) uniques (eg: 48.0; 3.0; 41.0)
  -               "name":   3.55% empty - 32301 ( 48.00%) uniques (eg: Otte; Staud; Roma)
  -           "lang_iso":   4.20% empty -    27 (  0.04%) uniques (eg: None; ita;    )
  -           "modifier":   7.31% empty -    88 (  0.13%) uniques (eg: 48.0; 3.0; 116.0)
  -         "first_name":   7.88% empty - 12315 ( 18.30%) uniques (eg: Bernhard; Johann; Giulio)
  -  "modification_time":  24.75% empty -  4689 (  6.97%) uniques (eg: NaT; 2013-02-14...; 2013-02-20...)
  -  "fk_abob_name_type":  70.46% empty -     8 (  0.01%) uniques (eg: nan; 1058.0; 1060.0)
  -              "notes":  86.47% empty -   420 (  0.62%) uniques (eg: None; ; Se fait ap...)
  - "comment_begin_year":  87.64% empty -    25 (  0.04%) uniques (eg: None; ; En septemb...)
  -   "comment_end_year":  87.68% empty -    12 (  0.02%) uniques (eg: None; ; Nom parfoi...)
  -         "apposition":  95.00% empty -  1892 (  2.81%) uniques (eg: None; Acquanegra; Loyola)
  -        "preposition":  95.52% empty -    37 (  0.05%) uniques (eg: None; dit de; de)
  -           "particle":  95.63% empty -   115 (  0.17%) uniques (eg: None; d'; van)
  -              "title":  98.37% empty -   229 (  0.34%) uniques (eg: None; d'; de)
  -         "begin_year":  98.74% empty -   279 (  0.41%) uniques (eg: 1883.0; 1882.0; nan)
  -           "end_year":  99.49% empty -   210 (  0.31%) uniques (eg: 1933.0; 1939.0; nan)
  -       "ordinal_text":  99.70% empty -    28 (  0.04%) uniques (eg: None; VIII; III)
  -        "ordinal_num":  99.90% empty -     9 (  0.01%) uniques (eg: nan; 8.0; 1.0)
  -        "begin_month":  99.98% empty -     9 (  0.01%) uniques (eg: nan; 9.0; 3.0)
  -          "begin_day":  99.99% empty -     9 (  0.01%) uniques (eg: nan; 17.0; 7.0)
  -          "end_month":  99.99% empty -     5 (  0.01%) uniques (eg: nan; 2.0; 7.0)
  -            "end_day":  99.99% empty -     5 (  0.01%) uniques (eg: nan; 15.0; 16.0)

Type parsing¶

According to the table before, we will parse each column by the most meaningful type.

Columns analysis¶

Here we will report the analysis of interesting information found on different columns. They are not exhaustive.

For some of the column, we will update their value.

begin_date & end_date¶

We create 2 new columns, made of the joining of begin_year, begin_month, begin_day and end_year, end_month, end_day.

creation_time¶

creator¶

lang_iso¶

Some cleaning is made on this column, in order to fit ISO639-2/T (3 letters code, native prefered, eg 'deu' instead of 'ger').

notes¶

All HTML tags, non ASCII chars and new line are removed.